Introduction¶

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

About the Dataset of Google Play Store Apps & Reviews¶

Data Source:
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed here.

Import Statements¶

In [42]:
import pandas as pd
import plotly.express as px

Notebook Presentation¶

In [2]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

Read the Dataset¶

In [3]:
df_apps = pd.read_csv('apps.csv')

Data Cleaning¶

Challenge: How many rows and columns does df_apps have? What are the column names? Look at a random sample of 5 different rows with .sample().

In [4]:
df_apps.shape
Out[4]:
(10841, 12)
In [5]:
df_apps.columns
Out[5]:
Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
      dtype='object')
In [26]:
df_apps.sample(5)
Out[26]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
536 Lexus Tech BH Service Pro LIFESTYLE NaN 0 0.88 10 Free 0 Teen Lifestyle
1680 Catalyst AZ FAMILY 4.80 12 8.40 500 Free 0 Teen Education
2744 love sms good morning COMMUNICATION 4.20 10 3.10 5,000 Free 0 Everyone Communication
3864 Results for FL Lottery NEWS_AND_MAGAZINES 4.60 146 7.10 10,000 Free 0 Everyone News & Magazines
8069 Fast like a Fox GAME 4.30 104389 12.00 5,000,000 Free 0 Everyone Adventure

Drop Unused Columns¶

Challenge: Remove the columns called Last_Updated and Android_Version from the DataFrame. We will not use these columns.

In [13]:
df_apps.columns
Out[13]:
Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres'],
      dtype='object')

Find and Remove NaN values in Ratings¶

Challenge: How may rows have a NaN value (not-a-number) in the Ratings column? Create DataFrame called df_apps_clean that does not include these rows.

In [19]:
df_apps['Rating'].isna().values.sum()
Out[19]:
1474
In [24]:
df_apps_clean = df_apps.dropna()
df_apps_clean.head()
print(df_apps_clean.shape)
(9367, 10)

Find and Remove Duplicates¶

Challenge: Are there any duplicates in data? Check for duplicates using the .duplicated() function. How many entries can you find for the "Instagram" app? Use .drop_duplicates() to remove any duplicates from df_apps_clean.

In [28]:
duplicated_rows = df_apps_clean[df_apps_clean.duplicated()]
print(duplicated_rows.shape)
duplicated_rows.head()
(476, 10)
Out[28]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
946 420 BZ Budeze Delivery MEDICAL 5.00 2 11.00 100 Free 0 Mature 17+ Medical
1133 MouseMingle DATING 2.70 3 3.90 100 Free 0 Mature 17+ Dating
1196 Cardiac diagnosis (heart rate, arrhythmia) MEDICAL 4.40 8 6.50 100 Paid $12.99 Everyone Medical
1231 Sway Medical MEDICAL 5.00 3 22.00 100 Free 0 Everyone Medical
1247 Chat Kids - Chat Room For Kids DATING 4.70 6 4.90 100 Free 0 Mature 17+ Dating
In [29]:
df_apps_clean = df_apps_clean.drop_duplicates()
In [30]:
df_apps_clean.shape
Out[30]:
(8891, 10)
In [31]:
# need to specify the subset for identifying duplicates
df_apps_clean = df_apps_clean.drop_duplicates(subset = ['App', 'Type', 'Price'])
In [33]:
df_apps_clean.shape
Out[33]:
(8199, 10)

Find Highest Rated Apps¶

Challenge: Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?

In [35]:
df_apps_clean['App'].loc[df_apps_clean['Rating'].idxmax()]
Out[35]:
'KBA-EZ Health Guide'
In [36]:
# Another method is to use sort_values
df_apps_clean.sort_values('Rating', ascending=False).head()
Out[36]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
21 KBA-EZ Health Guide MEDICAL 5.00 4 25.00 1 Free 0 Everyone Medical
1230 Sway Medical MEDICAL 5.00 3 22.00 100 Free 0 Everyone Medical
1227 AJ Men's Grooming LIFESTYLE 5.00 2 22.00 100 Free 0 Everyone Lifestyle
1224 FK Dedinje BGD SPORTS 5.00 36 2.60 100 Free 0 Everyone Sports
1223 CB VIDEO VISION PHOTOGRAPHY 5.00 13 2.60 100 Free 0 Everyone Photography

Find 5 Largest Apps in terms of Size (MBs)¶

Challenge: What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be limit in place or can developers make apps as large as they please?

In [37]:
df_apps_clean.sort_values('Size_MBs', ascending=False).head()
Out[37]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
9942 Talking Babsy Baby: Baby Games LIFESTYLE 4.00 140995 100.00 10,000,000 Free 0 Everyone Lifestyle;Pretend Play
10687 Hungry Shark Evolution GAME 4.50 6074334 100.00 100,000,000 Free 0 Teen Arcade
9943 Miami crime simulator GAME 4.00 254518 100.00 10,000,000 Free 0 Mature 17+ Action
9944 Gangster Town: Vice District FAMILY 4.30 65146 100.00 10,000,000 Free 0 Mature 17+ Simulation
3144 Vi Trainer HEALTH_AND_FITNESS 3.60 124 100.00 5,000 Free 0 Everyone Health & Fitness

Find the 5 App with Most Reviews¶

Challenge: Which apps have the highest number of reviews? Are there any paid apps among the top 50?

In [38]:
df_apps_clean.sort_values('Reviews', ascending=False).head()
Out[38]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
10805 Facebook SOCIAL 4.10 78158306 5.30 1,000,000,000 Free 0 Teen Social
10785 WhatsApp Messenger COMMUNICATION 4.40 69119316 3.50 1,000,000,000 Free 0 Everyone Communication
10806 Instagram SOCIAL 4.50 66577313 5.30 1,000,000,000 Free 0 Teen Social
10784 Messenger – Text and Video Chat for Free COMMUNICATION 4.00 56642847 3.50 1,000,000,000 Free 0 Everyone Communication
10650 Clash of Clans GAME 4.60 44891723 98.00 100,000,000 Free 0 Everyone 10+ Strategy

Plotly Pie and Donut Charts - Visualise Categorical Data: Content Ratings¶

In [40]:
ratings = df_apps_clean.Content_Rating.value_counts()
ratings
Out[40]:
Everyone           6621
Teen                912
Mature 17+          357
Everyone 10+        305
Adults only 18+       3
Unrated               1
Name: Content_Rating, dtype: int64
In [43]:
fig = px.pie(labels = ratings.index, values = ratings.values)
fig.show()
C:\Users\Khushi\anaconda3\lib\site-packages\plotly\express\_core.py:137: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  return args["labels"][column]
In [44]:
fig = px.pie(labels=ratings.index,
values=ratings.values,
title="Content Rating",
names=ratings.index,
)
fig.update_traces(textposition='outside', textinfo='percent+label')
 
fig.show()
C:\Users\Khushi\anaconda3\lib\site-packages\plotly\express\_core.py:137: FutureWarning:

Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.

In [45]:
fig = px.pie(labels=ratings.index,
values=ratings.values,
title="Content Rating",
names=ratings.index,
             hole = 0.6
)
fig.update_traces(textposition='outside', textinfo='percent+label')
 
fig.show()
C:\Users\Khushi\anaconda3\lib\site-packages\plotly\express\_core.py:137: FutureWarning:

Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.

Numeric Type Conversion: Examine the Number of Installs¶

Challenge: How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install?

Check the datatype of the Installs column.

Count the number of apps at each level of installations.

Convert the number of installations (the Installs column) to a numeric data type. Hint: this is a 2-step process. You'll have make sure you remove non-numeric characters first.

In [47]:
df_apps_clean['Installs'].describe()
Out[47]:
count          8199
unique           19
top       1,000,000
freq           1417
Name: Installs, dtype: object
In [48]:
df_apps_clean.Installs = df_apps_clean.Installs.astype(str).str.replace(',', "")
df_apps_clean.Installs = pd.to_numeric(df_apps_clean.Installs)
df_apps_clean[['App', 'Installs']].groupby('Installs').count()
Out[48]:
App
Installs
1 3
5 9
10 69
50 56
100 303
500 199
1000 698
5000 425
10000 988
50000 457
100000 1096
500000 504
1000000 1417
5000000 607
10000000 933
50000000 202
100000000 189
500000000 24
1000000000 20

Find the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate¶

Let's examine the Price column more closely.

Challenge: Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.

Remove all apps that cost more than $250 from the df_apps_clean DataFrame.

Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?

In [49]:
df_apps_clean.Price.describe()
Out[49]:
count     8199
unique      73
top          0
freq      7595
Name: Price, dtype: object
In [51]:
df_apps_clean.Price = df_apps_clean.Price.astype(str).str.replace('$', "")
df_apps_clean.Price = pd.to_numeric(df_apps_clean.Price)
C:\Users\Khushi\AppData\Local\Temp\ipykernel_6792\96199048.py:1: FutureWarning:

The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.

The most expensive apps sub $250¶

In [55]:
df_apps_clean.sort_values('Price', ascending = False).head()
Out[55]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
3946 I'm Rich - Trump Edition LIFESTYLE 3.60 275 7.30 10000 Paid 400.00 Everyone Lifestyle
2461 I AM RICH PRO PLUS FINANCE 4.00 36 41.00 1000 Paid 399.99 Everyone Finance
4606 I Am Rich Premium FINANCE 4.10 1867 4.70 50000 Paid 399.99 Everyone Finance
3145 I am rich(premium) FINANCE 3.50 472 0.94 5000 Paid 399.99 Everyone Finance
3554 💎 I'm rich LIFESTYLE 3.80 718 26.00 10000 Paid 399.99 Everyone Lifestyle
In [56]:
df_apps_clean = df_apps_clean[df_apps_clean['Price'] < 250]
df_apps_clean.sort_values('Price', ascending=False).head(5)
Out[56]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
2281 Vargo Anesthesia Mega App MEDICAL 4.60 92 32.00 1000 Paid 79.99 Everyone Medical
1407 LTC AS Legal MEDICAL 4.00 6 1.30 100 Paid 39.99 Everyone Medical
2629 I am Rich Person LIFESTYLE 4.20 134 1.80 1000 Paid 37.99 Everyone Lifestyle
2481 A Manual of Acupuncture MEDICAL 3.50 214 68.00 1000 Paid 33.99 Everyone Medical
2463 PTA Content Master MEDICAL 4.20 64 41.00 1000 Paid 29.99 Everyone Medical

Highest Grossing Paid Apps (ballpark estimate)¶

In [57]:
df_apps_clean['Revenue_Estimate'] = df_apps_clean.Installs.mul(df_apps_clean.Price)
df_apps_clean.sort_values('Revenue_Estimate', ascending=False)[:10]
C:\Users\Khushi\AppData\Local\Temp\ipykernel_6792\2703497668.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[57]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres Revenue_Estimate
9220 Minecraft FAMILY 4.50 2376564 19.00 10000000 Paid 6.99 Everyone 10+ Arcade;Action & Adventure 69,900,000.00
8825 Hitman Sniper GAME 4.60 408292 29.00 10000000 Paid 0.99 Mature 17+ Action 9,900,000.00
7151 Grand Theft Auto: San Andreas GAME 4.40 348962 26.00 1000000 Paid 6.99 Mature 17+ Action 6,990,000.00
7477 Facetune - For Free PHOTOGRAPHY 4.40 49553 48.00 1000000 Paid 5.99 Everyone Photography 5,990,000.00
7977 Sleep as Android Unlock LIFESTYLE 4.50 23966 0.85 1000000 Paid 5.99 Everyone Lifestyle 5,990,000.00
6594 DraStic DS Emulator GAME 4.60 87766 12.00 1000000 Paid 4.99 Everyone Action 4,990,000.00
6082 Weather Live WEATHER 4.50 76593 4.75 500000 Paid 5.99 Everyone Weather 2,995,000.00
7954 Bloons TD 5 FAMILY 4.60 190086 94.00 1000000 Paid 2.99 Everyone Strategy 2,990,000.00
7633 Five Nights at Freddy's GAME 4.60 100805 50.00 1000000 Paid 2.99 Teen Action 2,990,000.00
6746 Card Wars - Adventure Time FAMILY 4.30 129603 23.00 1000000 Paid 2.99 Everyone 10+ Card;Action & Adventure 2,990,000.00

Plotly Bar Charts & Scatter Plots: Analysing App Categories¶

In [58]:
# Get the number of different categories
df_apps_clean.Category.nunique()
Out[58]:
33
In [60]:
#To calculate the number of apps per category
top10_category = df_apps_clean.Category.value_counts()[:10]
top10_category
Out[60]:
FAMILY             1606
GAME                910
TOOLS               719
PRODUCTIVITY        301
PERSONALIZATION     298
LIFESTYLE           297
FINANCE             296
MEDICAL             292
PHOTOGRAPHY         263
BUSINESS            262
Name: Category, dtype: int64
In [61]:
bar = px.bar(x = top10_category.index, # index = category name
             y = top10_category.values)
 
bar.show()
In [62]:
# Grouping all apps by category and sum the number of installations
category_installs = df_apps_clean.groupby('Category').agg({'Installs': pd.Series.sum})
category_installs.sort_values('Installs', ascending=True, inplace=True)

Horizontal Bar Chart - Most Popular Categories (Highest Downloads)¶

In [63]:
h_bar = px.bar(x = category_installs.Installs,
               y = category_installs.index,
               orientation='h')
 
h_bar.show()
In [64]:
# Customizing
h_bar = px.bar(x = category_installs.Installs,
               y = category_installs.index,
               orientation='h',
               title='Category Popularity')
 
h_bar.update_layout(xaxis_title='Number of Downloads', yaxis_title='Category')
h_bar.show()

Category Concentration - Downloads vs. Competition¶

Challenge:

  • First, create a DataFrame that has the number of apps in one column and the number of installs in another:

  • Then use the plotly express examples from the documentation alongside the .scatter() API referenceto create scatter plot that looks like this.

Hint: Use the size, hover_name and color parameters in .scatter(). To scale the yaxis, call .update_layout() and specify that the yaxis should be on a log-scale like so: yaxis=dict(type='log')

In [65]:
cat_number = df_apps_clean.groupby('Category').agg({'App': pd.Series.count})
cat_merged_df = pd.merge(cat_number, category_installs, on='Category', how="inner")
print(f'The dimensions of the DataFrame are: {cat_merged_df.shape}')
cat_merged_df.sort_values('Installs', ascending=False)
The dimensions of the DataFrame are: (33, 2)
Out[65]:
App Installs
Category
GAME 910 13858762717
COMMUNICATION 257 11039241530
TOOLS 719 8099724500
PRODUCTIVITY 301 5788070180
SOCIAL 203 5487841475
PHOTOGRAPHY 263 4649143130
FAMILY 1606 4437554490
VIDEO_PLAYERS 148 3916897200
TRAVEL_AND_LOCAL 187 2894859300
NEWS_AND_MAGAZINES 204 2369110650
ENTERTAINMENT 102 2113660000
BOOKS_AND_REFERENCE 169 1665791655
PERSONALIZATION 298 1532352930
SHOPPING 180 1400331540
HEALTH_AND_FITNESS 243 1134006220
SPORTS 260 1096431465
BUSINESS 262 692018120
LIFESTYLE 297 503611120
MAPS_AND_NAVIGATION 118 503267560
FINANCE 296 455249400
WEATHER 72 361096500
EDUCATION 118 352852000
FOOD_AND_DRINK 94 211677750
DATING 134 140912410
ART_AND_DESIGN 61 114233100
HOUSE_AND_HOME 62 97082000
AUTO_AND_VEHICLES 73 53129800
LIBRARIES_AND_DEMO 64 52083000
COMICS 54 44931100
MEDICAL 292 39162676
PARENTING 50 31116110
BEAUTY 42 26916200
EVENTS 45 15949410
In [66]:
scatter = px.scatter(cat_merged_df, # data
                    x='App', # column name
                    y='Installs',
                    title='Category Concentration',
                    size='App',
                    hover_name=cat_merged_df.index,
                    color='Installs')
 
scatter.update_layout(xaxis_title="Number of Apps (Lower=More Concentrated)",
                      yaxis_title="Installs",
                      yaxis=dict(type='log'))
 
scatter.show()

Extracting Nested Data from a Column¶

Challenge: How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's .stack() method.

In [67]:
# Number of Genres?
len(df_apps_clean.Genres.unique())
Out[67]:
114
In [68]:
# Problem: Having multiple categories separated by ;
df_apps_clean.Genres.value_counts().sort_values(ascending=True)[:5]
Out[68]:
Lifestyle;Pretend Play      1
Strategy;Education          1
Adventure;Education         1
Role Playing;Brain Games    1
Tools;Education             1
Name: Genres, dtype: int64
In [69]:
# Split the strings on the semi-colon and then .stack them.
stack = df_apps_clean.Genres.str.split(';', expand=True).stack()
print(f'We now have a single column with shape: {stack.shape}')
num_genres = stack.value_counts()
print(f'Number of genres: {len(num_genres)}')
We now have a single column with shape: (8564,)
Number of genres: 53

Colour Scales in Plotly Charts - Competition in Genres¶

Challenge: Can you create this chart with the Series containing the genre data?

Try experimenting with the built in colour scales in Plotly. You can find a full list here.

  • Find a way to set the colour scale using the color_continuous_scale parameter.
  • Find a way to make the color axis disappear by using coloraxis_showscale.
In [70]:
bar = px.bar(x = num_genres.index[:15], # index = category name
             y = num_genres.values[:15], # count
             title='Top Genres',
             hover_name=num_genres.index[:15],
             color=num_genres.values[:15],
             color_continuous_scale='Agsunset')
 
bar.update_layout(xaxis_title='Genre',
yaxis_title='Number of Apps',
coloraxis_showscale=False)
 
bar.show()

Grouped Bar Charts: Free vs. Paid Apps per Category¶

In [71]:
#Split between Free and paid apps
df_apps_clean.Type.value_counts()
Out[71]:
Free    7595
Paid     589
Name: Type, dtype: int64
In [72]:
# group our data first by Category and then by Type.
df_free_vs_paid = df_apps_clean.groupby(["Category", "Type"], as_index=False).agg({'App': pd.Series.count})
df_free_vs_paid.head()
Out[72]:
Category Type App
0 ART_AND_DESIGN Free 58
1 ART_AND_DESIGN Paid 3
2 AUTO_AND_VEHICLES Free 72
3 AUTO_AND_VEHICLES Paid 1
4 BEAUTY Free 42

Challenge: Use the plotly express bar chart examples and the .bar() API reference to create this bar chart:

You'll want to use the df_free_vs_paid DataFrame that you created above that has the total number of free and paid apps per category.

See if you can figure out how to get the look above by changing the categoryorder to 'total descending' as outlined in the documentation here here.

In [73]:
g_bar = px.bar(df_free_vs_paid,
               x='Category',
               y='App',
               title='Free vs Paid Apps by Category',
               color='Type',
               barmode='group')
 
g_bar.update_layout(xaxis_title='Category',
                    yaxis_title='Number of Apps',
                    xaxis={'categoryorder':'total descending'},
                    yaxis=dict(type='log'))
 
g_bar.show()

Plotly Box Plots: Lost Downloads for Paid Apps¶

Challenge: Create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?

Use the Box Plots Guide and the .box API reference to create the following chart.

In [74]:
box = px.box(df_apps_clean,
             y='Installs',
             x='Type',
             color='Type',
             notched=True,
             points='all',
             title='How Many Downloads are Paid Apps Giving Up?')
 
box.update_layout(yaxis=dict(type='log'))
 
box.show()

Plotly Box Plots: Revenue by App Category¶

Challenge: See if you can generate the chart below:

Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?

Hint: I've used 'min ascending' to sort the categories.

In [75]:
df_paid_apps = df_apps_clean[df_apps_clean['Type'] == 'Paid']
box = px.box(df_paid_apps, 
             x='Category', 
             y='Revenue_Estimate',
             title='How Much Can Paid Apps Earn?')
 
box.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Ballpark Revenue',
                  xaxis={'categoryorder':'min ascending'},
                  yaxis=dict(type='log'))
 
 
box.show()